Kearney Advanced Analytics Case Study - WindFarm


Author: Tim Graf

Date: 24.02.2022

Executive Summary

In this Python Notebook I have shown outperformance for predicting the production sum of 20 wind farms using XGB as a model compared to both a persistence model and multivariate regression.

Data Analysis

I've looked at various dimensions to understand the data better.

  1. First I've realized how correlated all the assets are with each other and also, but less significantly, with the wind forecast. Hence, this may indiciate that the individual farms are located close to each other or the wind is very similar across the locations.

  2. Second, I've identified yearly seasonality with summer months having less production than winter months. There is no such trend to be seen in daily or weekly averages.

  3. Third, there is a clear relationship between wind forecast and energy production. Beginning around 4-7.5 m/s energy is produced. The energy production cuts off at around 2.5MW, probably due to security reasons that a wind farm cannot produce more than a this threshold. Also I can find anomalies where energy should be produced based on the wind forecast above 7.5 m/s.

  4. Fourth, I've spotted outliers in form of negative or zero production despite high wind forecasts. This may be fore a variety of reasons such as malfunction of the sensor, maintenance or upgrade of the wind production asset, or wind forecasts.

Feature Engineering

Dataset A Dataset B Dataset C Dataset D
2 Wind Forecast 2 Wind Forecast 2 Wind Forecast 2 Wind Forecast
20 Wind Production 20 Wind Production - Cleaned 20 Wind Production - Cleaned 20 Wind Production - Cleaned
- - Min, Max, Avg for each feature Min, Max, Avg for each feature
- - - Lagged Features & Rolling Means  
Datetime Datetime Datetime Datetime  

Models

Time Series can be analyzed using multiple methods:

Reasoning for XGBoost

XGBoost has been used and proven to be better compared to simple persistence models in the context of wind production prediction. See for example:

XGBoost has the benefits that it can a) handle missing data b) work well without normalization of data c) find non-linear relationships d) is computationally efficient e) works well with small datasets compared to Neural Networks.

Proceedure

I use both persistence as also simple linear regression as a benchmark model. Then I compare how my XGBoost performs based on the RMSE (root mean squared error) loss function compared to the benchmark models. Finally I compare the performance of my XGBoost with different datasets as inputs. I have tuned the parameters only once as this is computationally very heavy and I want to have a good comparison between the individual datasets.

Vector Autoregression

I have begun decomposing the time series to work with vector regression. Due to time reasons I have unfortunately not been able to complete it. Nevertheless I've attached the code named 04_var.py

Library Import

Data Import & Preprocessing

Production Site 1

All Production Data

Forecast Data

Optional: Skip the computation and import directly computed datasets

Resampling

Data Analysis

This part serves to understand the data better before starting to model.

Plotting of Production & Forecast

Weekly Sum of Production

Conclusion:

Weekly Sums of Forecast

Plots of excerpts

Boxplot Analysis

Box plots are good for an understanding of the distribution.

Scatterplots

All Data

Forecast vs. Wind Power

Conclusion:

Let's analyze this individually for every forecast and production site.

Correlation analysis

Conclusion:

Seasonality

What to look out for:

There are basically two methods to model the seasonality of a Time Series: additive and multiplicative. Main difference being that additive is linear while multiplicative is non-linear.

We find yearly seasonality

We find no monthly seasonality

No Daily Seasonality

Conclusion:

Dickey Fuller Test for Unit Root (optional)

The more negative this statistic, the more likely we are to reject the null hypothesis of non-stationarity (hence indicating that we have a stationary dataset). Rejecting the null hypothesis means that the process has no unit root, and in turn that the time series is stationary or does not have time-dependent structure.

Autocorrelation

ACF & PACF Plots

ACF:

PACF:

Ljung Box Test

The Ljung-Box test is a statistical test that checks if autocorrelation exists in a time series.

It uses the following hypotheses:

Feature Engineering

Background

Understanding Wind

Let's examine the wind to energy production conversion. This tells us that velocity (m/s) is a very important factor in predicting energy production. Also the area of the rotor blades is very important for the output. I assume that air density and efficiency is relatively similar across all 20 wind production assets given the above analysis.

$Production = 0.5 * area * velocity^3 * air density * efficiency$

Normalizing Data:

XGBoost is not sensitive to monotonic transformations of its features for the same reason that decision trees and random forests are not: the model only needs to pick "cut points" on features to split a node. Splits are not sensitive to monotonic transformations: defining a split on one scale has a corresponding split on the transformed scale.

Dataset A: Production and Forecast Data

This is the raw data which is unprocessed

Dataset B: Cleaned Production from maintenance, malfunction, errors

Why do we have negative power production sometimes?

Reasons for no power production:

Delete all rows with daily cummulative negative productions and where we have above 7.5 m/s wind forecast

Dataset C: Add Seasonality as Features

I add the following features:

To make sure to avoid forward-looking bias I calculate the values using a rolling window after 1 year. I include 1 year to have XGBoost learn to use these seasonality parameters. This means that we have a time-bias for the first year, which is part of the training data. However for the following years including the testing data, we don't have any forward-looking bias.

WARNING: This takes a long time to calculate! I have done the calculation in the seperat file 01_datapreprocessing.py and imported it for efficiency reasons. Feel free to let it run locally

Dataset D: Add lagged values

Now we can add the laggs of the production sums, the averaged forcasts, and rolling means of the past 24h, week, and month. These features have been chosen based on various iterations. After having run XGBoost I've seen that the rolling mean of 24h had significance in the feature importance plot. Hence I added 7d, and 30d for the model to learn the seasonality component of the data. This has signficantly increased RMSE:

Other

Time series data often requires other cleaning, scaling, and even transformation.

For example:

Modeling

0. Splitting Training and Testing Data

I have split the dataset for all models equally:

The value for splitting has been chosen so that we still have enough data for testing (from Fall 2016 to Spring 2017) and enough for training. This value is a standard which is usually chosen for machine learning splits

1. Persistence Model

This model is the most simple. It predicts values in t=5 to be values of t=1, so the values of 4 periods ahead are the same as the values of today

$\hat{y}_{t=4} = y_{t=0}$

2. Multivariate Regression

I use a multivariate linear regression model where I average the forecasts and the production sum to avoid multicollinearity.

$\hat{y}_{t=4} = c + {\beta_1}_{t=0}*forecast + {\beta_2}_{t=0}*production$

Note, please look into the results file for the results on the testing dataset.

Conclusion:

3. XG Boost

Note: This is the most complex and takes time to compute. Especially the hyperparameter tuning takes time. Therefore I have put the code in a seperate file named 05_xgboost.py

About XGB

XGBoost (extreme gradient boosting) is a form of Gradient-Tree-Boosting which is computationally efficient through its novel tree learning algorithm for handling sparse data, the use of parallel and distributed computing, cache optimization, and the way it handles weights among other reasons. XGBoost has been used extensively in ML-competitions and research, as it tends to be computationally efficient while leveraging the benefits of gradient boosting and Random Forest.

Hyperparameter Tuning

Each algorithm has its own specific set of parameters. To ensure optimal predictions by the applied ML methods, the parameters were tuned using a fixed or a random grid search. By leveraging time-series split and validation various or all combinations of the parameters were tested and the best among them, the ones minimizing the root mean squared error, were used for the prediction of the test subsample of the initial dataset. Ideally, all possible combinations of parameters should be tested in a “brute-force” procedure, but this becomes quickly computationally intense, so the range of the hyperparameters had to be limited. These tuned parameters were finally used for specifying the base learners in the stacked generalization model

Evaluation

Loss Function

I have chosen RMSE and not MSE, as I don't want to punish if we are too far off. Our losses from not hedging correctly are only exponential if we use options. Else if we just export or import energy we want to have approximately accurate predictions.

Results (RMSE, adj. R2)

Here I show the aggregated table summarizing the main loss function: RMSE and the explained variance using Adjusted R2. I use Adjusted R2 instead of R2 as we incorporate more predictors in every further alternation of the dataset. My main conclusions are:

Plots for XGBoost with Dataset D

Scatterplot

Feature Importance

Learning Curve

Predicted vs. Average

Next Steps

I have shown significant outperformance by the machine learning model XGB and feature engineering of the dataset. As next steps I would look at the following:

  1. Test other prediction algorithms (e.g. Prophet by Facebook)
  2. Enrich the dataset with better forecasts (e.g. hourly and more granurlarly for each location)
  3. Retrain the algorithm every day with the newest results
  4. Include dummies for when we have maintenance, upgrades, or other known events which influence our production